nhane data
NHANES-GCP: Leveraging the Google Cloud Platform and BigQuery ML for reproducible machine learning with data from the National Health and Nutrition Examination Survey
Katz, B. Ross, Khan, Abdul, York-Winegar, James, Titus, Alexander J.
Summary: NHANES, the National Health and Nutrition Examination Survey, is a program of studies led by the Centers for Disease Control and Prevention (CDC) designed to assess the health and nutritional status of adults and children in the United States (U.S.). NHANES data is frequently used by biostatisticians and clinical scientists to study health trends across the U.S., but every analysis requires extensive data management and cleaning before use and this repetitive data engineering collectively costs valuable research time and decreases the reproducibility of analyses. Here, we introduce NHANES-GCP, a Cloud Development Kit for Terraform (CDKTF) Infrastructure-as-Code (IaC) and Data Build Tool (dbt) resources built on the Google Cloud Platform (GCP) that automates the data engineering and management aspects of working with NHANES data. With current GCP pricing, NHANES-GCP costs less than $2 to run and less than $15/yr of ongoing costs for hosting the NHANES data, all while providing researchers with clean data tables that can readily be integrated for large-scale analyses. We provide examples of leveraging BigQuery ML to carry out the process of selecting data, integrating data, training machine learning and statistical models, and generating results all from a single SQL-like query. NHANES-GCP is designed to enhance the reproducibility of analyses and create a well-engineered NHANES data resource for statistics, machine learning, and fine-tuning Large Language Models (LLMs). Availability and implementation" NHANES-GCP is available at https://github.com/In-Vivo-Group/NHANES-GCP
- North America > United States > California > Los Angeles County > Los Angeles (0.15)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Information Technology > Services (1.00)
- Health & Medicine > Public Health (1.00)
- Education > Health & Safety > School Nutrition (0.87)
Application of multiview techniques to NHANES dataset
Research into disease-related health variables typically involve choosing health variables and conditions, and using statistical methods to study the strength of association of the variables with the condition [9]. These are then used to confirm known or suspected relationships between the behavioural/health factors or disease conditions. There may be information about health status that may be gleaned by considering different aspects of an individual's data, and investigating possible relationships between the variables. Representations that capture these relationships can be useful in predicting presence or risk level of medical conditions. The National Health and Nutrition Examination Survey (NHANES) dataset provides data on health measurements, taken from survey participants, comprising different categories including demographics, laboratory tests and physical measurements.
- North America > United States (0.68)
- North America > Mexico (0.04)
- Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (1.00)
- Health & Medicine > Consumer Health (1.00)
- Health & Medicine > Therapeutic Area > Immunology (0.69)